Group 3 - Airbnb Insights in Manhattan , NY

Chiahui Chen (Abby), Kwun Wah Michael Chiu, Yiqing Huang, Yan (Ava) Zhang

Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. This dataset describes the listing activity and metrics in Manhattan, NY for October 2020, containing 20K+ datasets.

Data Source: http://insideairbnb.com/get-the-data.html

Outline

  1. Sentiment analysis on reviews
  2. Price Prediction
  3. Superhost vs Regular Host

Data Clean-Up

1. Review Data for Sentiment Analysis

The review data set has:

2. Listing Data

The listing data set has:

Price

NaN Checking

Bathroom Type and Number

The Rest of Variables

Categorical PReprocessing - Property_type

Categorical Variable Preprocessing - Amenities Preprocessing

Data for finding 2 - Price prediction

Data for finding 3 - Superhost

Finding 1- Sentiment Analysis(Review)

1. Sentiment Analysis - Data Exploratory

2. Sentiment Analysis - Part 1. Positive

3. Sentiment Analysis - Part 2. Negative

4. Sentiment Analysis-Part 3. Sentiment Score

5. Sentiment Analysis-Part 4. Neighbourhood

6. Sentiment Analysis - Part 6. Property Type

Conclusion:

Findings: Overall there is a Positive vibe from the listings at Manhattan.

Managerial insights: This data is very insightful and can help Airbnb hosts improve their customers’ experience. For example, a shared room in a tent has the most positive results among all types, which shows people enjoy novel experience nowadays. East Harlem has the least sentiment score, hosts in that area could work on their services to boost their positive reviews.

Finding 2: Price Prediction

We are going to explore the pattern of house prices based on house features and outer data sources such as the location of NYC transit and popular attractions to predict price by leveraging regression models.

1. Price Prediction- Data Exploratory

a. Target Variable - Price

In this part, we decide to use log_price as our dependent variable in the ML process since the distribution of price is extreme skewness.

b. Labeling Neighbourhood_cleansed

Using List_of_Neighborhood.csv to assign the Neighborhood with area label. (Downtown, Islands,Midtown, Midtown and Downtown,Uptown)

Data Source:https://en.wikipedia.org/wiki/List_of_Manhattan_neighborhoods

This is a list of neighborhoods in the New York City borough of Manhattan arranged geographically from the north of the island to the south.

c. Property Type & Room Type

2. Price Prediction- Adding New Variables

Using NYC_Subway.csv to obtain the location info of subway stations in NYC so that we can utilize it to calculate the distance and the number of stations close to listing in 500m.

Data Source: https://data.ny.gov/Transportation/NYC-Transit-Subway-Entrance-And-Exit-Data/i9wp-a4ja

Using NYC_Top20.csv to count the nearby attractions

Data Source: https://www.timeout.com/newyork/attractions/top-attractions-in-manhattan

3. Price Prediction-Get Dummies of categorical variables

4. Price Prediction- Machine Learning

a. Feature Scaling

b. Split Data into Train and Test set

c. Grid Search - Finding the parameters for RandomForestRegressor

d. RandomForestRegressor

e. Feature Importance

f. Compare our model(RandomForestRegressor) with others

Conclusion:

Findings:

RandomForestRegressor tunned by Grid Search performs the best among all with MAE=0.318 based on log term or $79 in USD. According to outcome of feature importance, top 10 contributed variables are related to private space, bathroom type and count, bedroom type and listing location, whereas other property type, neighborhood area and review scores are the least. Another distinctive finding, through combing Airbnb data with NYC trainsit and attractions, listing price highly correlated to the number of subway stations and attractions nearby the house.

Managerial insight:

Our regression model can sucessfully support the business by predicting the benchmark of listing price, and futher being used as the prototype of pricing recommendation system for new hosts. Besides, the result of feature importance indicates the factors that model can be improved in the future reseach and development. We believe our accomplishment can provide business insights on Airbnb operation in real business scene.

Finding 3: Super Host

1. Super Host- Data Exploratory

a. Overview of Super Host in Manhattan

  1. Key Metrics Overview
  1. Proportion: Superhost takes up 14% of the total hosts in Manhattan

b. Ratings Discovery

  1. Overall Rating: review_scores_rating=96 is a good threshold to seperate Super Host from Regular Host
  1. Quality Rating: what does guests value most when rating?
  1. Occupancy Rating: differences in number of reviews of last 12 months
  1. Service Rate: Host response rate comparison

c. Price

d. Map

Conclusion:

Findings:

“Superhost” correlates most with amenities, then followed by cleanliness and value. Communication and check-in experience ranked 4th and 5th. For those above 93 points superhosts, surprisingly they charge less than regular hosts by 42 dollars on average per night.

Managerial insights:

We can help regular hosts to improve the areas aforementioned to elevate the superhost proportion in Manhattan. To maximize profit, superhosts with higher scores can try to increase their price step by step to test out the market and to match up with the high quality of the house and the superior service superhosts provided.

Extended findings on Superhost and Regular Host

Objective: To promote regular host to superhost

1) From the overview of Superhost and Regular host distribution geographically, Top 3 goes to Harlem, Hell’s Kitchen, Midtown, and by each district, try to promote the proportion of regularhost to superhost to meet the 14% benchmark to help out the middle ranking and even the small district that are left out.

Findings

1) In general, Superhost outperform regular host in price (also the price value), response rate, acceptance rate, reviews volume, reviews score, however we see the socre of locations are very close, meaning there is a biased concentration of superhost.

Mangerial Insight

Methodology:

1) Taking a 5% discount off the average of Superhost in response rate, review scores of rating, cleaniness, value, communication, checkin and looking at the concentration of simulated superhost and regular host in each district, we take the common districts that hit the benchmark in each criterias:

2) By looking at the review score table (in green) generated above, we have below conclusion

Conclusion:

1) Boosting the ratio of superhost in the following areas For the middle stream: Lower East Side, China Town, Finance District For the lower stream: Civic Center, Little Italy, Marble Hill, Nolita

2) This finding also echoes the sentiment analysis that the high sentiment score are mostly concentrated and the lower part of Manhattan (as highlighted in yellow in the map)

</font>